Hardware Accelerator For IM2COL Operation

ABSTRACT

The present disclosure advantageously provides a matrix expansion unit that includes an input data selector, a first register set, a second register set, and an output data selector. The input data selector is configured to receive first matrix data in a columnwise format. The first register set is coupled to the input data selector, and includes a plurality of data selectors and a plurality of registers arranged in a first shift loop. The second register set is coupled to the data selector, and includes a plurality of data selectors and a plurality of registers arranged in a second shift loop. The output data selector is coupled to the first register set and the second register set, and is configured to output second matrix data in a rowwise format.

BACKGROUND

The present disclosure relates to computer systems. More particularly, the present disclosure relates to computer systems that include neural networks.

Artificial neural networks (ANNs), such as deep neural networks (DNNs), convolutional neural networks (CNNs), etc., are a popular solution to a wide array of challenging classification, recognition and regression problems. However, many ANN models require a large number of calculations involving a large number of weights and activations, which presents a significant challenge with respect to access, storage and performance, particularly for mobile and other power or storage-constrained devices. An ANN hardware accelerator accelerates these calculations, such as, for example, convolution operations performed by CNNs.

Typically, native convolution operations are not performed by a CNN due to the complicated dataflow and expensive datapaths that are usually required. Instead, native convolution operations are converted into generic matrix multiplication (GEMM) operations, and then the GEMM operations are executed more efficiently by a central processing unit (CPU), specialized processor, hardware accelerator processing engine, etc., using optimized software libraries or specialized hardware. More particularly, an “IM2COL” software function is used to convert the filter (weight) matrix and the input feature map (IFM) matrix for each convolution operation into an expanded format that is compatible with a GEMM operation. The IM2COL versions of each filter (weight) matrix and each IFM matrix are generated and stored in memory, and then loaded from memory and processed by the GEMM operation.

Unfortunately, the IM2COL version of each IFM matrix is much larger than the original version because many elements of the original version are duplicated in the IM2COL version. Consequently, the performance of the GEMM-based convolution operation is low, and the memory requirement high, because many elements of each original IFM matrix must be stored in (and subsequently loaded from) memory multiple times due to the expanded format of the IM2COL version.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 depicts an ANN, in accordance with embodiments of the present disclosure.

FIG. 2 depicts a convolutional neural network (CNN), in accordance with embodiments of the present disclosure.

FIG. 3A depicts a convolutional layer calculation for a CNN, in accordance with an embodiment of the present disclosure.

FIGS. 3B and 3C depict a converted convolutional layer calculation for a CNN, in accordance with an embodiment of the present disclosure.

FIG. 4A depicts a convolutional layer calculation for a CNN, in accordance with another embodiment of the present disclosure.

FIG. 4B depicts a converted convolutional layer calculation for a CNN, in accordance with another embodiment of the present disclosure.

FIG. 5 depicts a block diagram of a system, in accordance with embodiments of the present disclosure.

FIG. 6 depicts a hardware accelerator, in accordance with embodiments of the present disclosure.

FIG. 7 depicts a portion of a hardware accelerator, in accordance with an embodiment of the present disclosure.

FIG. 8 depicts a data flow diagram for the portion of a hardware accelerator depicted in FIG. 7, in accordance with an embodiment of the present disclosure.

FIG. 9 depicts a matrix expansion unit, in accordance with an embodiment of the present disclosure.

FIG. 10A depicts a state diagram for a matrix expansion unit, in accordance with an embodiment of the present disclosure.

FIG. 10B depicts an output table for a matrix expansion unit, in accordance with an embodiment of the present disclosure.

FIGS. 11A to 11F depict state tables for a matrix expansion unit, in accordance with an embodiment of the present disclosure.

FIG. 12 depicts a flow diagram functionality for expanding a matrix, in accordance with an embodiment of the present disclosure.

DETAILED DESCRIPTION

Embodiments of the present disclosure will now be described with reference to the drawing figures, in which like reference numerals refer to like parts throughout.

Embodiments of the present disclosure advantageously provide a matrix expansion unit that efficiently implements the IM2COL software function in hardware. The matrix expansion unit is disposed inline between the memory and the CPU, specialized processor, hardware accelerator processing engine, etc., and converts the original version of the IFM matrix to an IM2COL version. The matrix expansion unit advantageously reduces the memory footprint to that of the native convolution operation, reduces the memory bandwidth required for data movement, which increases the power efficiency at the system level, and takes advantage of the compute regularity of matrix multiplication, which can be more readily optimized in hardware.

In one embodiment, a matrix expansion unit includes an input data selector, a first register set, a second register set, and an output data selector. The input data selector is configured to receive first matrix data in a columnwise format. The first register set is coupled to the input data selector, and includes a plurality of data selectors and a plurality of registers arranged in a first shift loop. The second register set is coupled to the data selector, and includes a plurality of data selectors and a plurality of registers arranged in a second shift loop. The output data selector is coupled to the first register set and the second register set, and is configured to output second matrix data in a rowwise format.

An ANN models the relationships between input data or signals and output data or signals using a network of interconnected nodes that is trained through a learning process. The nodes are arranged into various layers, including, for example, an input layer, one or more hidden layers, and an output layer. The input layer receives input data, such as, for example, image data, and the output layer generates output data, such as, for example, a probability that the image data contains a known object. Each hidden layer provides at least a partial transformation of the input data to the output data. A DNN has multiple hidden layers in order to model complex, nonlinear relationships between input data and output data.

In a fully-connected, feedforward ANN, each node is connected to all of the nodes in the preceding layer, as well as to all of the nodes in the subsequent layer. For example, each input layer node is connected to each hidden layer node, each hidden layer node is connected to each input layer node and each output layer node, and each output layer node is connected to each hidden layer node. Additional hidden layers are similarly interconnected. Each connection has a weight value, and each node has an activation function, such as, for example, a linear function, a step function, a sigmoid function, a tanh function, a rectified linear unit (ReLU) function, etc., that determines the output of the node based on the weighted sum of the inputs to the node. The input data propagates from the input layer nodes, through respective connection weights to the hidden layer nodes, and then through respective connection weights to the output layer nodes.

More particularly, at each input node, input data is provided to the activation function for that node, and the output of the activation function is then provided as an input data value to each hidden layer node. At each hidden layer node, the input data value received from each input layer node is multiplied by a respective connection weight, and the resulting products are summed or accumulated into an activation value that is provided to the activation function for that node. The output of the activation function is then provided as an input data value to each output layer node. At each output layer node, the output data value received from each hidden layer node is multiplied by a respective connection weight, and the resulting products are summed or accumulated into an activation value that is provided to the activation function for that node. The output of the activation function is then provided as output data. Additional hidden layers may be similarly configured to process data.

FIG. 1 depicts ANN 10, in accordance with an embodiment of the present disclosure.

ANN 10 includes input layer 20, one or more hidden layers 30, 40, 50, etc., and output layer 60. Input layer 20 includes one or more input nodes 21, 22, 23, etc. Hidden layer 30 includes one or more hidden nodes 31, 32, 33, 34, 35, etc. Hidden layer 40 includes one or more hidden nodes 41, 42, 43, 44, 45, etc. Hidden layer 50 includes one or more hidden nodes 51, 52, 53, 54, 55, etc. Output layer 60 includes one or more output nodes 61, 62, etc. Generally, ANN 10 includes N hidden layers, input layer 20 includes “i” nodes, hidden layer 30 includes “j” nodes, hidden layer 40 includes “k” nodes, hidden layer 50 includes “m” nodes, and output layer 60 includes “o” nodes.

In one embodiment, N equals 3, i equals 3, j, k and m equal 5 and o equals 2 (depicted in FIG. 1). Input node 21 is coupled to hidden nodes 31 to 35, input node 22 is coupled to hidden nodes 31 to 35, and input node 23 is coupled to hidden nodes 31 to 35. Hidden node 31 is coupled to hidden nodes 41 to 45, hidden node 32 is coupled to hidden nodes 41 to 45, hidden node 33 is coupled to hidden nodes 41 to 45, hidden node 34 is coupled to hidden nodes 41 to 45, and hidden node 35 is coupled to hidden nodes 41 to 45. Hidden node 41 is coupled to hidden nodes 51 to 55, hidden node 42 is coupled to hidden nodes 51 to 55, hidden node 43 is coupled to hidden nodes 51 to 55, hidden node 44 is coupled to hidden nodes 51 to 55, and hidden node 45 is coupled to hidden nodes 51 to 55. Hidden node 51 is coupled to output nodes 61 and 62, hidden node 52 is coupled to output nodes 61 and 62, hidden node 53 is coupled to output nodes 61 and 62, hidden node 54 is coupled to output nodes 61 and 62, and hidden node 55 is coupled to output nodes 61 and 62.

Many other variations of input, hidden and output layers are clearly possible, including hidden layers that are locally-connected, rather than fully-connected, to one another.

Training an ANN includes optimizing the connection weights between nodes by minimizing the prediction error of the output data until the ANN achieves a particular level of accuracy. One method is backpropagation, or backward propagation of errors, which iteratively and recursively determines a gradient descent with respect to the connection weights, and then adjusts the connection weights to improve the performance of the network.

A multi-layer perceptron (MLP) is a fully-connected ANN that has an input layer, an output layer and one or more hidden layers. MLPs may be used for natural language processing applications, such as machine translation, speech recognition, etc. Other ANNs include recurrent neural networks (RNNs), long short-term memories (LSTMs), sequence-to-sequence models that include an encoder RNN and a decoder RNN, shallow neural networks, etc.

A CNN is a variation of an MLP that may be used for classification or recognition applications, such as image recognition, speech recognition, etc. A CNN has an input layer, an output layer and multiple hidden layers including convolutional layers, pooling layers, normalization layers, fully-connected layers, etc. Each convolutional layer applies a sliding dot product or cross-correlation to an input volume, applies an activation function to the results, and then provides the activation or output volume to the next layer. Convolutional layers typically use the ReLU function as the activation function. In certain embodiments, the activation function is provided in a separate activation layer, such as, for example, a ReLU layer. A pooling layer reduces the dimensions of the output volume received from the preceding convolutional layer, and may calculate an average or a maximum over small clusters of data, such as, for example, 2×2 matrices. In certain embodiments, a convolutional layer and a pooling layer may form a single layer of a CNN. The fully-connected layers follow the convolutional and pooling layers, and include a flatten layer and a classification layer, followed by a normalization layer that includes a normalization function, such as the SoftMax function. The output layer follows the last fully-connected layer; in certain embodiments, the output layer may include the normalization function.

FIG. 2 depicts CNN 15, in accordance with an embodiment of the present disclosure. CNN 15 includes input layer 20, one or more hidden layers, such as convolutional layer 30-1, pooling layer 30-2, hidden (flatten) layer 40, hidden (classification) layer 50, etc., and output layer 60. Many other variations of input, hidden and output layers are contemplated.

Input layer 20 includes one or more input nodes 21, etc., that present the input data, such as a color image, as an input volume to the first convolutional layer, e.g., convolutional layer 30-1. The input volume is a three-dimensional matrix that has a width, a height and a depth. For example, input data that represent a color image are presented as an input volume that is 512 pixels×512 pixels×3 channels (red, green, blue); other input volume dimensions may also be used, such as 32×32×3, 64×64×3, 128×128×3, etc., 32×32×1, 64×64×1, 128×128×1, 512×512×1, etc.

Convolutional layer 30-1 is locally-connected to input layer 20, and includes a plurality of nodes that are connected to local regions in the input volume (not depicted for clarity). For a CNN that uses a standard convolution, each node computes a dot product between the node's weights and the respective local region of the input volume. An activation function is then applied to the results of each convolution calculation to produce an output volume that is provided as an input volume to the subsequent layer. The activation function may be applied by each convolutional layer node or by the nodes of a subsequent locally-connected ReLU layer.

Pooling layer 30-2 is locally-connected to convolutional layer 30-1, and includes a plurality of nodes that are connected to local regions in the input volume (not depicted for clarity). Pooling layer 30-2 also produces an output volume that is provided as the input volume to the subsequent layer, such as, for example, another convolutional layer 30-1, a flatten layer 40, etc. In certain embodiments, convolutional layer 30-1 and pooling layer 30-2 form a single hidden layer 30. Similarly, in certain embodiments, convolutional layer 30-1, a ReLU layer and pooling layer 30-2 form a single hidden layer 30. Generally, the output volumes of the convolutional and pooling layers may be described as feature maps, and one or more single hidden layers 30 form a feature learning portion of CNN 15.

Hidden layer 40 is a “flatten” layer that is locally-connected to pooling layer 30-2, and includes one or more hidden (flatten) nodes 41, 42, 43, 44, 45, etc. Hidden (flatten) layer 40 “flattens” the output volume produced by the preceding pooling layer 30-2 into a column vector, which is provided to the subsequent, fully-connected hidden layer 50.

Hidden layer 50 is a classification layer that is fully-connected to hidden (flatten) layer 40, and includes one or more hidden (classification) nodes 51, 52, 53, 54, 55, etc.

Output layer 60 includes one or more output nodes 61, 62, etc., and is fully-connected to hidden (classification) layer 50. Fully-connected output layer 60 receives the classification results output by hidden (classification) layer 50, and each node outputs a predicted class score. A normalization function, such as a Softmax function, may be applied to the predicted class scores by output layer 60, or, alternatively, by an additional layer interposed between hidden (classification) layer 50 and output layer 60.

Similar to ANNs, training a CNN includes optimizing the connection weights between nodes by minimizing the prediction error of the output data until the CNN achieves a particular level of accuracy. As noted above, backpropagation may be used to iteratively and recursively determines a gradient descent with respect to the connection weights, and then adjusts the connection weights to improve the performance of the network. Matrix multiplication operations, and, more particularly, multiply-and-accumulate (MAC) operations, are used extensively by CNNs, as well as other ANNs.

FIG. 3A depicts a convolutional layer calculation 70 for a CNN, in accordance with an embodiment of the present disclosure.

Filter 72 (3×3×3) includes weight matrix 72.1 (w¹), weight matrix 72.2 (w²), and weight matrix 72.3 (w³), input feature maps 73 (6×6×3) include input data matrix 73.1, input data matrix 73.2 and input data matrix 73.3, and output feature map 74 (4×4×1) includes an output data matrix. Filter 72 is convolved with input feature maps 73 to produce output feature map 74. In this example, the output data matrix element o₁ is the sum of the dot products of filter 72.1 (w¹) and the upper left quadrant of input data matrix 73.1 (a¹ _(q1)), filter 72.2 (w²) and the upper left quadrant of input data matrix 73.2 (a² _(q1)), and filter 72.3 (w³) and the upper left quadrant of input data matrix 73.3 (a³ _(q1)).

More particularly, the dot product, i.e., the sum of the element by element multiplication, of filter 72.1 (w¹) and the upper left quadrant of input data matrix 73.1 (a¹ _(q1)) is equal to w¹ ₁×a¹ ₁+w¹ ₂×a¹ ₂+w¹ ₃×a¹ ₃+w¹ ₄×a¹ ₇+w¹ ₅×a¹ ₈+w¹ ₆×a¹ ₉+w¹ ₇×a¹ ₁₃+w¹ ₈×a¹ ₁₄+w¹ ₉×a¹ ₁₅. The dot products of filter 72.2 (w²) and the upper left quadrant of input data matrix 73.2 (a² _(q1)), and filter 72.3 (w³) and the upper left quadrant of input data matrix 73.3 (a³ _(q1)) are calculated in the same manner, i.e., the dot product of filter 72.2 (w²) and the upper left quadrant of input data matrix 73.2 (a² _(q1)) is equal to w² ₁×a² ₁+w² ₂×a² ₂+w² ₃×a² ₃+w² ₄×a² ₇+w² ₅×a² ₈+w² ₆×a² ₉+w² ₇×a² ₁₃+w² ₈×a² ₁₄+w² ₉×a² ₁₅, and the dot product of filter 72.3 (w³) and the upper left quadrant of input data matrix 73.3 (a³ _(q1)) is equal to w³ ₁×a³ ₁+w³ ₂×a³ ₂+w³ ₃×a³ ₃+w³ ₄×a³ ₇+w³ ₅×a³ ₈+w³ ₆×a³ ₉+w³ ₇×a³ ₁₃+w³ ₈×a³ ₁₄+w³ ₉×a³ ₁₅ ³ ₉.

Output data matrix element o₂ is the sum of the dot products of filter 72.1 (w¹) and the next upper quadrant of input data matrix 73.1, filter 72.2 (w²) and the next upper quadrant of input data matrix 73.2, and filter 72.3 (w³) and the next upper quadrant of input data matrix 73.3. The “next” upper quadrant in each input data matrix 73.1, 73.2 and 73.3 has been shifted one column to the right relative to the first upper quadrant. More particularly, the dot product of filter 72.1 (w¹) and the next upper quadrant of input data matrix 73.1 is equal to w¹ ₁×a¹ ₂+w¹ ₂×a¹ ₃+w¹ ₃×a¹ ₄+w¹ ₄×a¹ ₈+w¹ ₅×a¹ ₉+w¹ ₆×a¹ ₁₀+w¹ ₇×a¹ ₁₄+w¹ ₈×a¹ ₁₅+w¹ ₉×a¹ ₁₆. The dot products of filter 72.2 (w²) and the next upper quadrant of input data matrix 73.2, and filter 72.3 (w³) and the next upper quadrant of input data matrix 73.3 are calculated in the same manner, i.e., the dot product of filter 72.2 (w²) and the next upper quadrant of input data matrix 73.2 is equal to w² ₁×a² ₂+w² ₂×a² ₃+w² ₃×a² ₄+w² ₄×a² ₈+w² ₅×a² ₉+w² ₆×a² ₁₀+w² ₇×a² ₁₄+w² ₈×a² ₁₅+w² ₉×a² ₁₆, and the dot product of filter 72.3 (w³) and the next upper quadrant of input data matrix 73.3 is equal to w³ ₁×a³ ₂+w³ ₂×a³ ₃+w³ ₃×a³ ₄+w³ ₄×a³ ₈+w³ ₅×a³ ₉+w³ ₆×a³ ₁₀+w³ ₇×a³ ₁₄+w³ ₈×a³ ₁₅+w³ ₉×a³ ₁₆.

FIGS. 3B and 3C depicts a converted convolutional layer calculation for a CNN, in accordance with an embodiment of the present disclosure.

In one embodiment, the convolutional layer calculations for CNNs executing on central processor units (CPUs), graphics processing units (GPUs), etc., may be converted into generic matrix multiplication (GEMM) operations, which may leverage GEMM-optimized software libraries, or, alternatively, which may be implemented in a dedicated hardware accelerator using a two-dimensional array of MAC units.

Convolution layer calculation 70 is converted into a GEMM operation by converting filter 72 into converted weight matrix 75 (1×27) and input feature maps 73 into converted input data matrix 76 (27×16). After multiplying converted weight matrix 75 and converted input data matrix 76, converted output data matrix 77 (1×16) is then reformed into output feature map 74 (4×4). For ease of illustration, converted input data matrix 76 (27×16) and converted output data matrix 77 (1×16) are depicted in transposed orientations (16×27 and 16×1, respectively) in FIG. 3B.

In this example, converted output data matrix element o₁ is the sum of the dot products of the first (i.e., only) row of converted weight matrix 75 and the first column of converted input data matrix 76. As shown in FIG. 3B, the converted weight matrix 75 includes filter 72.1 (w¹), filter 72.2 (w²), and filter 72.3 (w³), while the first column of converted input data matrix 76 includes the elements of the upper left quadrant of input data matrix 73.1 (a¹ _(q1)), the upper left quadrant of input data matrix 73.2 (a² _(q1)), and the upper left quadrant of input data matrix 73.3 (a³ _(q1)).

More particularly, the converted output data matrix element o₁ is equal to w¹ ₁×a¹ ₁+w¹ ₂×a¹ ₂+w¹ ₃×a¹ ₃+w¹ ₄×a¹ ₇+w¹ ₅×a¹ ₈+w¹ ₆×a¹ ₉+w¹ ₇×a¹ ₁₃+w¹ ₈×a¹ ₁₄+w¹ ₉×a¹ ₁₅+w² ₁×a² ₁+w² ₂×a² ₂+w² ₃×a² ₃+w² ₄×a² ₇+w² ₅×a² ₈+w² ₆×a² ₉+w² ₇×a² ₁₃+w² ₈×a² ₁₄+w² ₉×a² ₁₅+w³ ₁×a³ ₁+w³ ₂×a³ ₂+w³ ₃×a³ ₃+w³ ₄=a³ ₇+w³ ₅×a³ ₈+w³ ₆×a³ ₉+w³ ₇×a³ ₁₃+w³ ₈×a³ ₁₄+w³ ₉×a³ ₁₅ ³ ₉. The converted output data matrix element o₁ is equal to the output data matrix element o₁.

FIG. 4A depicts a convolutional layer calculation 80 for a CNN, in accordance with an embodiment of the present disclosure.

Filter 82 (3×3) includes weight matrix (w), which is convolved with input feature map (IFM) 83 (6×6) to produce output feature map (OFM) 84 (2×2). In this example, the output data matrix element o₁ is the dot product of filter 82 (w) and the upper left quadrant of IFM 83 (a_(q1)).

More particularly, the dot product, i.e., the sum of the element by element multiplication, of filter 82 (w) and the first (i.e., upper left) quadrant of IFM 83 (a_(q1)) is equal to w₁×a₁+w₂×a₂+w₃×a₃+w₄×a₅+w₅×a₆+w₆×a₇+w₇×a₉+w₈×a₁₀+w₉×a₁₁. Output data matrix element o₃ is the dot product of filter 82 (w) and the second (i.e., lower left) quadrant of IFM 83. The second quadrant in IFM 83 has been shifted one row down relative to the first quadrant. More particularly, the dot product of filter 82 (w) and the second quadrant of IFM 83 is equal to w₁×a₅+w₂×a₆+w₃×a₇+w₄×a₉+w₅×a₁₀+w₆×a₁₁+w₇×a₁₃+w₈×a₁₄+w₉×a₁₅. Output data matrix element o₂ is the dot product of filter 82 (w) and the third (i.e., upper right) quadrant of IFM 83. The third quadrant in IFM 83 has been shifted one column right relative to the first quadrant. More particularly, the dot product of filter 82 (w) and the third quadrant of IFM 83 is equal to w₁×a₂+w₂×a₃+w₃×a₄+w₄×a₆+w₅×a₇+w₆×a₈+w₇×a₁₀+w₈×a₁₁+w₉×a₁₂. Output data matrix element o₄ is the dot product of filter 82 (w) and the fourth (i.e., lower right) quadrant of IFM 83. The fourth quadrant in IFM 83 has been shifted one row down relative to the third quadrant. More particularly, the dot product of filter 82 (w) and the fourth quadrant of IFM 83 is equal to w₁×a₆+w₂×a₇+w₃×a₈+w₄×a₁₀+w₅×a₁₁+w₆×a₁₂+w₇×a₁₄+w₈×a₁₅+w₉×a₁₆.

FIG. 4B depict a converted convolutional layer calculation for a CNN, in accordance with an embodiment of the present disclosure.

Convolution layer calculation 80 is converted into a GEMM operation by converting filter 82 into converted weight matrix 85 (1×9) and IFM 83 into converted input data matrix 86 (9×4) using, for example, the IM2COL software function. After multiplying converted weight matrix 85 and converted input data matrix 86, converted output data matrix 87 (1×4) is then reformed into output feature map 84 (2×2).

In this example, converted output data matrix element o₁ is the sum of the dot products of the first (i.e., only) row of converted weight matrix 85 and the first column of converted input data matrix 86. As shown in FIG. 4B, the converted weight matrix 85 includes filter 82 (w), while the first column of converted input data matrix 86 includes the elements of the upper left quadrant of IFM 83 (a_(q1)). Many elements of IFM 83 are duplicated in converted input data matrix 86, i.e., elements a₂, a₃, a₅, a₈, a₉, a₁₂, a₁₄ and a₁₅ are repeated twice, while elements a6, a7, a10 and a11 are repeated four times.

More particularly, the element o₁ of converted output data matrix 87 is equal to w₁×a₁+w₄×a₅+w₇×a₉+w₂×a₂+w₅×a₆+w₈×a₁₀+w₃×a₃+w₆×a₇+w₉×a₁₁, which is the same as element o₁ of OFM 84 shown above. The element o₃ of converted output data matrix 87 is equal to w₁×a₅+w₄×a₉+w₇×a₁₃+w₂×a₆+w₅×a₁₀+w₈×a₁₄+w₃×a₇+w₆×a₁₁+w₉×a₁₅, which is the same as element o₃ of OFM 84 shown above. The element o₂ of converted output data matrix 87 is equal to w₁×a₂+w₄×a₆+w₇×a₁₀+w₂×a₃+w₅×a₇+w₈×a₁₁+w₃×a₄+w₆×a₈+w₉×a₁₂, which is the same as element o₂ of OFM 84 shown above. The element o₄ of converted OFM 87 is equal to w₁×a₆+w₄×a₁₀+w₇×a₁₄+w₂×a₇+w₅×a₁₁+w₈×a₁₅+w₃×a₈+w₆×a₁₂+w₉×a₁₆, which is the same as element o₄ of OFM 84 shown above.

FIG. 5 depicts a block diagram of system 100, in accordance with embodiments of the present disclosure.

System 100 includes communication bus 110 coupled to one or more processors 120, memory 130, I/O interfaces 140, display interface 150, one or more communication interfaces 160, and one or more HAs 170. Generally, I/O interfaces 140 are coupled to I/O devices 142 using a wired or wireless connection, display interface 150 is coupled to display 152, and communication interface 160 is connected to network 162 using a wired or wireless connection. In some embodiments, certain components of system 100 are implemented as a system-on-chip (SoC) 102; in other embodiments, system 100 may be hosted on a traditional printed circuit board, motherboard, etc.

In certain embodiments, system 100 is an embedded system in which one or more of the components depicted in FIG. 3 are not present, such as, for example, I/O interfaces 140, I/O devices 142, display interface 150, display 152, etc. Additionally, certain components, when present, may be optimized based on various design constraints, such as, for example, power, area, etc., such as, for example, HA 170.

Communication bus 110 is a communication system that transfers data between processor 120, memory 130, I/O interfaces 140, display interface 150, communication interface 160, HAs 170, as well as other components not depicted in FIG. 3. Power connector 112 is coupled to communication bus 110 and a power supply (not shown). In certain embodiments, communication bus 110 is a network-on-chip (NoC).

Processor 120 includes one or more general-purpose or application-specific microprocessors that executes instructions to perform control, computation, input/output, etc. functions for system 100. Processor 120 may include a single integrated circuit, such as a micro-processing device, or multiple integrated circuit devices and/or circuit boards working in cooperation to accomplish the functions of processor 120. Additionally, processor 120 may include multiple processing cores, as depicted in FIG. 3. Generally, system 100 may include one or more processors 120, each containing one or more processing cores as well as various other modules.

In some embodiments, system 100 may include 2 processors 120, each containing multiple processing cores. For example, one processor 120 may be a high performance processor containing 4 “big” processing cores, e.g., Arm Cortex-A73, Cortex-A75, Cortex-A76, etc., while the other processor 120 may be a high efficiency processor containing 4 “little” processing cores, e.g., Arm Cortex-53, Arm Cortex-55, etc. In this example, the “big” processing cores include a memory management unit (MMU). In other embodiments, system 100 may be an embedded system that includes a single processor 120 with one or more processing cores, such as, for example, an Arm Cortex-M core. In these embodiments, processor 120 typically includes a memory protection unit (MPU).

In many embodiments, processor 120 may also be configured to execute classification-based machine learning (ML) models, such as, for example, ANNs, DNNs, CNNs, RNNs, SVM, Naïve Bayes, etc. In these embodiments, processor 120 may provide the same functionality as a hardware accelerator, such as HA 170. For example, system 100 may be an embedded system that does not include HA 170.

In addition, processor 120 may execute computer programs or modules, such as operating system 132, software modules 134, etc., stored within memory 130. For example, software modules 134 may include an ML application, an ANN application, a DNN application, a CNN application, an RNN application, etc.

Generally, storage element or memory 130 stores instructions for execution by processor 120 and data. Memory 130 may include a variety of non-transitory computer-readable medium that may be accessed by processor 120. In various embodiments, memory 130 may include volatile and nonvolatile medium, non-removable medium and/or removable medium. For example, memory 130 may include any combination of random access memory (RAM), DRAM, SRAM, ROM, flash memory, cache memory, and/or any other type of non-transitory computer-readable medium.

Memory 130 contains various components for retrieving, presenting, modifying, and storing data. For example, memory 130 stores software modules that provide functionality when executed by processor 120. The software modules include operating system 132 that provides operating system functionality for system 100. Software modules 134 provide various functionality, such as image classification using CNNs, etc. Data 136 may include data associated with operating system 132, software modules 134, etc.

I/O interfaces 140 are configured to transmit and/or receive data from I/O devices 142. I/O interfaces 140 enable connectivity between processor 120 and I/O devices 142 by encoding data to be sent from processor 120 to I/O devices 142, and decoding data received from I/O devices 142 for processor 120. Generally, data may be sent over wired and/or wireless connections. For example, I/O interfaces 140 may include one or more wired communications interfaces, such as USB, Ethernet, etc., and/or one or more wireless communications interfaces, coupled to one or more antennas, such as WiFi, Bluetooth, cellular, etc.

Generally, I/O devices 142 provide input to system 100 and/or output from system 100. As discussed above, I/O devices 142 are operably connected to system 100 using a wired and/or wireless connection. I/O devices 142 may include a local processor coupled to a communication interface that is configured to communicate with system 100 using the wired and/or wireless connection. For example, I/O devices 142 may include a keyboard, mouse, touch pad, joystick, etc.

Display interface 150 is configured to transmit image data from system 100 to monitor or display 152.

Communication interface 160 is configured to transmit data to and from network 162 using one or more wired and/or wireless connections. Network 162 may include one or more local area networks, wide area networks, the Internet, etc., which may execute various network protocols, such as, for example, wired and/or wireless Ethernet, Bluetooth, etc. Network 162 may also include various combinations of wired and/or wireless physical layers, such as, for example, copper wire or coaxial cable networks, fiber optic networks, Bluetooth wireless networks, WiFi wireless networks, CDMA, FDMA and TDMA cellular wireless networks, etc.

HAs 170 are configured to execute ML models, such as, for example, ANNs, CNNs, RNNs, etc., in support of various applications embodied by software modules 134. Generally, HAs 170 include one or more processors, coprocessors, processing engines (PEs), compute engines (CEs), etc., such as, for example, CPUs, GPUs, NPUs (e.g., the ARM ML Processor), DSPs, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), controllers, microcontrollers, matrix multiplier circuits, MAC arrays, etc. HAs 170 also include a communications bus interface as well as non-volatile and/or volatile memories, such as, for example, ROM, flash memory, SRAM, DRAM, etc.

In many embodiments, HA 170 receives the ANN model and weights from memory 130 over communication bus 110 for storage in local volatile memory (e.g., SRAM, DRAM, etc.). In other embodiments, HA 170 receives a portion of the ANN model and weights from memory 130 over communication bus 110. In these embodiments, HA 170 determines the instructions needed to execute the ANN model or ANN model portion. In other embodiments, the ANN model (or ANN model portion) simply includes the instructions needed to execute the ANN model (or ANN model portion). In these embodiments, processor 120 determines the instructions needed to execute the ANN model, or, processor 120 divides the ANN model into ANN model portions, and then determines the instructions needed to execute each ANN model portion. The instructions are then provided to HA 170 as the ANN model or ANN model portion.

In further embodiments, HA 170 may store ANN models, instructions and weights in non-volatile memory. In certain embodiments, the ANN model may be directly implemented in hardware using PEs, CEs, matrix multiplier units, MAC arrays, etc. Generally, HA 170 receives input data from memory 130 over communication bus 110, and transmit output data to memory 130 over communication bus 110. In certain embodiments, the input data may be associated with a layer (or portion of a layer) of the ANN model, and the output data from that layer (or portion of that layer) may be transmitted to memory 130 over communication bus 110.

For example, the ARM ML Processor supports a variety of ANNs, including CNNs and RNNs, for classification, object detection, image enhancements, speech recognition and natural language understanding. The ARM ML Processor includes a control unit, a direct memory access (DMA) engine, local memory and 16 CEs. Each CE includes, inter alia, a MAC engine that performs convolution operations, a programmable layer engine (PLE), local SRAM, a weight decoder, a control unit, a direct memory access (DMA) engine, etc. Each MAC engine performs up to eight 16-wide dot products with accumulation. Generally, the PLE performs non-convolution operations, such as, for example, pooling operations, ReLU activations, etc. Each CE receives input feature maps (IFMs) and weights sets over the NoC and stores them in local SRAM. The MAC engine and PLE process the IFMs to generate the output feature maps (OFMs), which are also stored in local SRAM prior to transmission over the NoC.

FIG. 6 depicts a block diagram of hardware accelerator 170, in accordance with embodiments of the present disclosure.

HA 170 includes one or more controllers 172, communication bus interface 174, local memory 176 (e.g., SRAM, DRAM, etc.), one or more PEs 180, and one or more matrix expansion units (MEUs) 200. Controller 172 is coupled to communication bus interface 174, local memory 176, PEs 180 and MEU 200, and generally controls the components, functions, data flow, etc., of HA 170. In certain embodiments, controller 172 is an Arm Cortex-M33 microcontroller (MCU); other processors, microprocessors, controllers, microcontrollers, etc., are also contemplated.

Memory 176 is coupled to communication bus interface 174, PEs 180 and MEU 200, and receives IFM data and weights from memory 130 via communication bus interface 174, provides the IFM data (activations, “A”) to MEU 200, provides associated weights (“W”) to PEs 180, and receives OFM data (dot products, “O”) from PEs 180. Memory 176 may send the OFM data back to memory 130 via communication bus interface 174, after which new IFM data and associated weights are received from memory 130 via communication bus interface 174. Alternatively, memory 176 may send the OFM data back to MEU 200 as new IFM data (new activations, “A”), and may send new associated weights (“W”) to PEs 180.

MEU 200 is coupled to memory 176 and PEs 180, and converts or expands the original version of the IFM data (activations, “A”) to an IM2COL version, and provides the expanded IFM data (expanded activation sequences, “A_(i)”) to PEs 180, as discussed in detail below.

Generally, each PE 180 may execute at least a portion of an ANN model using at least a portion of the ANN weights. In many embodiments, each PE 180 is a MAC unit that includes an 8-bit integer (INT8) multiplier and a 32-bit integer (INT32) accumulator register. Other types of PEs 180 are also contemplated. In many embodiments, a number of PEs 180 may be arranged as systolic array (SA) 190, such as, for example, 4 PEs, 8 PEs, 16 PEs (depicted in FIG. 6), 32 PEs, etc. In many embodiments, PEs 180 may be interconnected using registers that provide temporary storage for weight and/or activation operands as they cycle through SA 190. In certain embodiments, PEs 180 may be interconnected by a NoC using a ring topology, a star topology, a mesh topology, etc. In other embodiments, PEs 180 may be interconnected using a cross-bar switch, direct connections, etc.

FIG. 7 depicts a portion of hardware accelerator 170, in accordance with an embodiment of the present disclosure.

In this embodiment, SA 190 includes four PEs 180 (i.e., PE₁, PE₂, PE₃ and PE₄) coupled to memory 176 and MEU 200. Within SA 190, PE₃ is coupled to PE₁, and PE₄ is coupled to PE₂. In other embodiments, PE₁, PE₂, PE₃ and PE₄ may form a systolic subarray (SSA) 192 within SA 190. In these embodiments, SA 190 includes additional PEs 180, such as, for example, 4 additional PEs 180, 12 additional PEs 180, 24 additional PEs 180, etc., which may form additional SSAs 192; each additional SSA 192 may be coupled to an additional MEU 200. Generally, the number of PEs 180 within each SSA 192 and the number of expanded activation sequences (“Ai”) that are provided by each MEU 200 depend on the underlying matrix dimensions of the IFM data and the weights. For example, the embodiment depicted in FIG. 7 reflects the matrix dimensions of filter 82 (3×3), IFM 83 (4×4) and OFM 84 (2×2) depicted in FIG. 4A.

With respect to the weights, PE₁ and PE₂ receive the sequence of weights (W), one weight per processing cycle. PE₃ receives the sequence of weights (W) from PE₁, one weight per processing cycle after the initial weight is delayed or skewed by one processing cycle. Similarly, PE₄ receives the sequence of weights (W) from PE₂, one weight per processing cycle after the initial weight is delayed or skewed by one processing cycle.

With respect to the activations, MEU 200 reads the elements of a sequence of activations (A) from memory 176, and generates four unique expanded activation sequences (A₁, A₂, A₃ and A₄), as described below. Each PE 180 receives one of the unique activation sequences (A₁, A₂, A₃ or A₄) from MEU 200. PE₁ receives the expanded activation sequence A₁, one expanded activation per processing cycle. PE₂ receives the expanded activation sequence A₂, one expanded activation per processing cycle. PE₃ receives the expanded activation sequence A₃, one expanded activation per processing cycle after the initial expanded activation is delayed or skewed by one processing cycle. PE₄ receives the expanded activation sequence A₄, one expanded activation per processing cycle after the initial expanded activation is delayed or skewed by one processing cycle.

Generally, PE₁, PE₂, PE₃ and PE₄ calculate and then output respective dot products O₁, O₂, O₃ and O₄ based on the weights (W) and the respective sequences of expanded activations (A₁, A₂, A₃ and A₄).

After the last weight (W) is received from memory 176 and the last element of expanded activation sequence (A₁) is received from MEU 200, PE₁ calculates the final dot product (O₁), and outputs the final dot product (O₁) to memory 176. Similarly, after the last weight (W) is received from memory 176 and the last element of expanded activation sequence (A₂) is received from MEU 200, PE₂ calculates the final dot product (O₂), and outputs the final dot product (O₂) to memory 176. After the last weight (W) is received from PE₁ and the last element of expanded activation sequence (A₃) is received from MEU 200, PE₃ calculates the final dot product (O₃), and outputs the final dot product (O₃) to memory 176, delayed or skewed by one processing cycle. Similarly, after the last weight (W) is received from PE₂ and the last element of expanded activation sequence (A₄) is received from MEU 200, PE₄ calculates the final dot product (O₄), and outputs the final dot product (O₄) to memory 176, delayed or skewed by one processing cycle. The process then repeats for the next sequences of weights (W), activations (A) and expanded activation sequences (A₁, A₂, A₃ and A₄).

FIG. 8 depicts data flow diagram 202 for the portion of a hardware accelerator depicted in FIG. 7, in accordance with an embodiment of the present disclosure.

Prior to processing data, PE₁, PE₂, PE₃ and PE₄ are reset or initialized, which includes setting the value of the accumulation registers to zero. Once data processing has commenced, PE₁ and PE₂ may be reset or initialized at the beginning of the first processing cycle, while PE₃ and PE₄ may be reset or initialized at the beginning of the second processing cycle.

After MEU 200 is reset, the first processing cycle begins. The sequence of weights provided by memory 176 (w₁ to w₉) is based on converted weight matrix 85 (one weight per cycle, read left to right), and the sequences of expanded activations provided by MEU 200 (A₁, A₂, A₃ and A₄) are based on converted input data matrix 86 (one row per cycle, each row read left to right), as depicted in FIG. 4B.

During the first processing cycle, memory 176 provides w₁ to PE₁ and PE₂, and MEU 200 provides element a₁ (from A₁) to PE₁, and element a₅ (from A₂) to PE₂. PE₁ multiplies w₁ and a₁ and then accumulates the result, while PE₂ multiples w₁ and a₅ and then accumulates the result.

During the second processing cycle, memory 176 provides w₄ to PE₁ and PE₂, PE₁ provides w₁ to PE₃, PE₂ provides w₁ to PE₄, and MEU 200 provides element a₅ (from A₁) to PE₁, element a₆ (from A₂) to PE₂, element a₂ (from A₃) to PE₃, and element a₁₀ (from A₄) to PE₄. PE₁ multiplies w₄ and a₅ and then accumulates the result, PE₂ multiples w₄ and a₉ and then accumulates the result, PE₃ multiplies w₁ and a₂ and then accumulates the result, and PE₄ multiplies w₁ and a₆ and then accumulates the result.

During the third processing cycle, memory 176 provides w₇ to PE₁ and PE₂, PE₁ provides w₄ to PE₃, PE₂ provides w₄ to PE₄, and MEU 200 provides element a₉ (from A₁) to PE₁, element a₁₃ (from A₂) to PE₂, element a₆ (from A₃) to PE₃, and element a₁₀ (from A₄) to PE₄. PE₁ multiplies w₇ and a₉ and then accumulates the result, PE₂ multiples w₇ and a₁₃ and then accumulates the result, PE₃ multiplies w₄ and a₆ and then accumulates the result, and PE₄ multiplies w₄ and a₁₀ and then accumulates the result.

During the fourth processing cycle, memory 176 provides w₂ to PE₁ and PE₂, PE₁ provides w₇ to PE₃, PE₂ provides w₇ to PE₄, and MEU 200 provides element a₂ (from A₁) to PE₁, element a₆ (from A₂) to PE₂, element a₁₀ (from A₃) to PE₃, and element a₁₄ (from A₄) to PE₄. PE₁ multiplies w₂ and a₂ and then accumulates the result, PE₂ multiples w₂ and a₆ and then accumulates the result, PE₃ multiplies w₇ and an and then accumulates the result, and PE₄ multiplies w₇ and a₁₄ and then accumulates the result.

During the fifth processing cycle, memory 176 provides w₅ to PE₁ and PE₂, PE₁ provides w₂ to PE₃, PE₂ provides w₂ to PE₄, and MEU 200 provides element a₆ (from A₁) to PE₁, element a₁₀ (from A₂) to PE₂, element a₃ (from A₃) to PE₃, and element a₇ (from A₄) to PE₄. PE₁ multiplies w₅ and a₆ and then accumulates the result, PE₂ multiples w₅ and a₁₀ and then accumulates the result, PE₃ multiplies w₂ and a₃ and then accumulates the result, and PE₄ multiplies w₂ and a₇ and then accumulates the result.

During the sixth processing cycle, memory 176 provides w₈ to PE₁ and PE₂, PE₁ provides w₅ to PE₃, PE₂ provides w₅ to PE₄, and MEU 200 provides element a₁₀ (from A₁) to PE₁, element a₁₄ (from A₂) to PE₂, element a₇ (from A₃) to PE₃, and element a₁₁ (from A₄) to PE₄. PE₁ multiplies w₈ and a₁₀ and then accumulates the result, PE₂ multiples w₈ and a₁₄ and then accumulates the result, PE₃ multiplies w₅ and a₇ and then accumulates the result, and PE₄ multiplies w₅ and a₁₁ and then accumulates the result.

During the seventh processing cycle, memory 176 provides w₃ to PE₁ and PE₂, PE₁ provides w₈ to PE₃, PE₂ provides w₈ to PE₄, and MEU 200 provides element a₃ (from A₁) to PE₁, element a₇ (from A₂) to PE₂, element a₁₁ (from A₃) to PE₃, and element a₁₅ (from A₄) to PE₄. PE₁ multiplies w₃ and a₃ and then accumulates the result, PE₂ multiples w₃ and a₇ and then accumulates the result, PE₃ multiplies w₈ and a₁₁ and then accumulates the result, and PE₄ multiplies w₈ and a₁₅ and then accumulates the result.

During the eighth processing cycle, memory 176 provides w₆ to PE₁ and PE₂, PE₁ provides w₃ to PE₃, PE₂ provides w₃ to PE₄, and MEU 200 provides element a₇ (from A₁) to PE₁, element a₁₁ (from A₂) to PE₂, element a₄ (from A₃) to PE₃, and element a₈ (from A₄) to PE₄. PE₁ multiplies w₆ and a₇ and then accumulates the result, PE₂ multiples w₆ and a₁₁ and then accumulates the result, PE₃ multiplies w₃ and a₄ and then accumulates the result, and PE₄ multiplies w₃ and a₈ and then accumulates the result.

During the ninth processing cycle, memory 176 provides w₉ to PE₁ and PE₂, PE₁ provides w₆ to PE₃, PE₂ provides w₆ to PE₄, and MEU 200 provides element a₁₁ (from A₁) to PE₁, element a₁₅ (from A₂) to PE₂, element a₈ (from A₃) to PE₃, and element a₁₂ (from A₄) to PE₄. PE₁ multiplies w₉ and a₁₁ and then accumulates the result, PE₂ multiples w₉ and a₁₅ and then accumulates the result, PE₃ multiplies w₆ and a₈ and then accumulates the result, and PE₄ multiplies w₆ and a₁₂ and then accumulates the result. At the end of the ninth processing cycle, PE₁ outputs the value in the accumulation register as dot product O₁, PE₂ outputs the value in the accumulation register as dot product O₃.

During the next processing cycle (which may be the first processing cycle for the next set of weights and expanded activation sequences), PE₁ provides w₉ to PE₃, PE₂ provides w₉ to PE₄, and MEU 200 provides element a₁₂ (from A₃) to PE₃, and element a₁₆ (from A₄) to PE₄. PE₃ multiplies w₉ and a₁₂ and then accumulates the result, and PE₄ multiplies w₉ and a₁₆ and then accumulates the result. At the end of this processing cycle, PE₃ outputs the value in the accumulation register as dot product O₂, PE₄ outputs the value in the accumulation register as dot product O₄.

In summary, dot product O₁ is equal to w₁×a₁+w₄×a₅+w₇×a₉+w₂×a₂+w₅×a₆+w₈×a₁₀+w₃×a₃+w₆×a₇+w₉×a₁₁, which is the same as element o₁ of converted output data matrix 87 and element o₁ of OFM 84 shown above. Dot product O₃ is equal to w₁×a₅+w₄×a₉+w₇×a₁₃+w₂×a₆+w₅×a₁₀+w₈×a₁₄+w₃×a₇+w₆×a₁₁+w₉×a₁₅, which is the same as element o₃ of converted output data matrix 87 and element o₃ of OFM 84 shown above. Dot product O₂ is equal to w₁×a₂+w₄×a₆+w₇×a₁₀+w₂×a₃+w₅×a₇+w₈×a₁₁+w₃×a₄+w₆×a₈+w₉×a₁₂, which is the same as element o₂ of converted output data matrix 87 and element o₂ of OFM 84 shown above. Dot product O₄ is equal to w₁×a₆+w₄×a₁₀+w₇×a₁₄+w₂×a₇+w₅×a₁₁+w₈×a₁₅+w₃×a₈+w₆×a₁₂+w₉×a₁₆, which is the same as element o₄ of converted output data matrix 87 and element o₄ of OFM 84 shown above.

FIG. 9 depicts matrix expansion unit 200, in accordance with an embodiment of the present disclosure.

MEU 200 includes controller 210, input data selector 220, register set 230, register set 240, and output data selector 250.

Controller 210 generally controls the components, data flow, etc., of MEU 200 in response to control signals received from controller 172. Controller 210 is coupled to input data selector 220, register set 230, register set 240 and output data selector 250. With respect to register set 230, controller 210 is coupled to data selectors 230-M₁, 230-M₂, 230-M₃, 230-M₄ and registers 230-X₁, 230-X₂, 230-X₃, 230-X₄ (not shown for clarity). With respect to register set 240, controller 210 is coupled to data selectors 240-M₁, 240-M₂, 240-M₃, 240-M₄, and registers 240-Y₁, 240-Y₂, 240-Y₃, 240-Y₄ (not shown for clarity).

Input data selector 220 receives activation data from memory 176, and, based on a control signal from controller 210, sends the activation data to registers 230-X₁, 230-X₂, 230-X₃, 230-X₄, or registers 240-Y₁, 240-Y₂, 240-Y₃, 240-Y₄. The activation data are received in a columnwise format. Generally, input data selector 220 is a single input, multiple output switch such as, for example, a demultiplexer, etc. In this embodiment, input data selector 220 has one input coupled to memory 176, and eight outputs respectively coupled to a respective first input of data selectors 230-M₁, 230-M₂, 230-M₃, 230-M₄, 240-M₁, 240-M₂, 240-M₃ and 240-M₄.

Register set 230 generally includes a plurality of data selectors and a plurality of registers. Register set 230 is coupled to input data selector 220 and output data selector 250. In this embodiment, register set 230 includes four data selectors 230-M₁, 230-M₂, 230-M₃ and 230-M₄, and four registers 230-X₁, 230-X₂, 230-X₃ and 230-X₄, arranged in a shift-loop. Generally, data selectors 230-M₁, 230-M₂, 230-M₃, 230-M₄, are a multiple input, single output switch such as, for example, a multiplexer, etc. In this embodiment, each data selector 230-M₁, 230-M₂, 230-M₃, 230-M₄ has two inputs and one output. Data selector 230-M₁ has a first input coupled to a respective output of input data selector 220, a second input coupled to register 230-X₂, and an output coupled to register 230-X₁. Data selector 230-M₂ has a first input coupled to a respective output of input data selector 220, a second input coupled to register 230-X₃, and an output coupled to register 230-X₂. Data selector 230-M₃ has a first input coupled to a respective output of input data selector 220, a second input coupled to register 230-X₄, and an output coupled to register 230-X₃. Data selector 230-M₄ has a first input coupled to a respective output of input data selector 220, a second input coupled to register 230-X₁, and an output coupled to register 230-X₄. Registers 230-X₁, 230-X₂, 230-X₃ and 230-X₄ are coupled to respective inputs of output data selector 250.

Register set 240 generally includes a plurality of data selectors and a plurality of registers. Register set 240 is coupled to input data selector 220 and output data selector 250. In this embodiment, register set 240 includes four data selectors 240-M₁, 240-M₂, 240-M₃ and 240-M₄, and four registers 240-Y₁, 240-Y₂, 240-Y₃ and 240-Y₄, arranged in a shift-loop. Generally, data selectors 240-M₁, 240-M₂, 240-M₃, 240-M₄, are a multiple input, single output switch such as, for example, a multiplexer, etc. In this embodiment, each data selector 240-M₁, 240-M₂, 240-M₃, 240-M₄ has two inputs and one output. Data selector 240-M₁ has a first input coupled to a respective output of input data selector 220, a second input coupled to register 240-Y₂ and an output coupled to register 240-Y₁. Data selector 240-M₂ has a first input coupled to a respective output of input data selector 220, a second input coupled to register 240-Y₃ and an output coupled to register 240-Y₂. Data selector 240-M₃ has a first input coupled to a respective output of input data selector 220, a second input coupled to register 240-Y₄ and an output coupled to register 240-Y₃. Data selector 240-M₄ has a first input coupled to a respective output of input data selector 220, a second input coupled to register 240-Y₁ and an output coupled to register 240-Y₄. Registers 240-Y₁, 240-Y₂, 240-Y₃ and 240-Y₄ are coupled to respective inputs of output data selector 250.

Output data selector 250 receives expanded activation data from register set 230 and register set 240, and sends the expanded activation data to a plurality of PEs 180 based on a control signal from controller 210. The expanded activation data are sent in a rowwise format. Generally, output data selector 250 is a multiple input, multiple output switch such as, for example, a multiplexer, etc. In this embodiment, output data selector 250 has eight inputs respectively coupled to registers 230-X₁, 230-X₂, 230-X₃, 230-X₄, 240-Y₁, 240-Y₂, 240-Y₃ and 240-Y₄, and four outputs respectively coupled to four PEs 180, i.e., PE₁, PE₂, PE₃ and PE₄.

Operation of MEU 200 will be explained with reference to FIGS. 10A, 10B and 11A to 11F.

FIG. 10A depicts state machine transition diagram 300 for matrix expansion unit 200, in accordance with an embodiment of the present disclosure.

State machine transition diagram 300 depicts nine unique states, i.e., S1, S2, S3, S4, S5, S6, S7, S8 and S9. Each state is associated with a particular processing cycle described above, i.e., S1 is associated with the first processing cycle, S2 is associated with the second processing cycle, etc., and each state only transitions to the succeeding state, i.e., S1 transitions to S2 at the completion of the first processing cycle, S2 transitions to S3 at the completion of the second processing cycle, etc. After a reset, S1 is entered and the first processing cycle begins. At the completion of the ninth processing cycle, S9 transitions to S1 without a reset in order to output the final expanded activations for expanded activation sequences A₃ and A₄.

Generally, controller 210 manages the state transitions by sending control signals to change the input and/or output settings of the data selectors of MEU 200.

FIG. 10B depicts output table 310 for matrix expansion unit 200, in accordance with an embodiment of the present disclosure.

MEU output table 310 lists the registers 230-X₁, 230-X₂, 230-X₃, 230-X₄, 240-Y₁, 240-Y₂, 240-Y₃, 240-Y₄ that are selected by output data selector 250 to provide the output from MEU 200. Each state provides four register values, one value for each expanded activation sequence A₁, A₂, A₃ and A₄.

During operation in state S1, controller 210 sends a control signal to output data selector 250 to select register 230-X₁ to output a first element of A₁, select register 230-X₂ to output a first element of A₂, select register 240-Y₁ to output an element of A₃, and select register 240-Y₂ to output an element of A₄. Due to the delayed or skewed processing described above, the outputs provided by registers 240-Y₁ and 240-Y₂ will be either zero (after a reset) or the last elements of A₃ and A₄ from a preceding set of weights and expanded activations.

During operation in state S2, controller 210 sends a control signal to output data selector 250 to select register 230-X₁ to output the second element of A₁, select register 230-X₂ to output the second element of A₂, select register 240-Y₁ to output the first element of A₃, and select register 240-Y₂ to output the first element of A₄.

During operation in state S3, controller 210 sends a control signal to output data selector 250 to select register 230-X₁ to output the third element of A₁, select register 230-X₂ to output the third element of A₂, select register 240-Y₁ to output the second element of A₃, and select register 240-Y₂ to output the second element of A₄. The configuration of output data selector 250 is the same as for states S1, S2 and S3.

During operation in state S4, controller 210 sends a control signal to output data selector 250 to select register 240-Y₃ to output the fourth element of A₁, select register 240-Y₄ to output the fourth element of A₂, select register 240-Y₁ to output the third element of A₃, and select register 240-Y₂ to output the third element of A₄.

During operation in state S5, controller 210 sends a control signal to output data selector 250 to select register 240-Y₃ to output the fifth element of A₁, select register 240-Y₄ to output the fifth element of A₂, select register 230-X₁ to output the fourth element of A₃, and select register 230-X₂ to output the fourth element of A₄.

During operation in state S6, controller 210 sends a control signal to output data selector 250 to select register 240-Y₃ to output the sixth element of A₁, select register 240-Y₄ to output the sixth element of A₂, select register 230-X₁ to output the fifth element of A₃, and select register 230-X₂ to output the fifth element of A₄. The configuration of output data selector 250 is the same as for states S5 and S6.

During operation in state S7, controller 210 sends a control signal to output data selector 250 to select register 230-X₃ to output the seventh element of A₁, select register 230-X₄ to output the seventh element of A₂, select register 230-X₁ to output the sixth element of A₃, and select register 230-X₂ to output the sixth element of A₄.

During operation in state S8, controller 210 sends a control signal to output data selector 250 to select register 230-X₃ to output the eighth element of A₁, select register 230-X₄ to output the eighth element of A₂, select register 240-Y₁ to output the seventh element of A₃, and select register 240-Y₂ to output the seventh element of A₄.

During operation in state S9, controller 210 sends a control signal to output data selector 250 to select register 230-X₃ to output the ninth element of A₁, select register 230-X₄ to output the ninth element of A₂, select register 240-Y₁ to output the eighth element of A₃, and select register 240-Y₂ to output the eighth element of A₄. The configuration of output data selector 250 is the same as for states S8 and S9.

During operation in state S1 following state S9 (i.e., the beginning of the next set of weights and expanded activations), the outputs provided by registers 240-Y₁ and 240-Y₂ will be the ninth elements of A₃ and A₄, respectively.

FIGS. 11A to 11F depict state table 320 for matrix expansion unit 200, in accordance with an embodiment of the present disclosure.

MEU state table 320 depicts the activation values a₁ to a₁₆ stored in each register for states S1 to S9 based on the embodiments of IFM 83 and converted input data matrix 86 depicted in FIGS. 4A and 4B. States S1 to S9 are followed by states S1 and S2 of the next IFM data sequence (i.e., activation values b₁ to b₁₆).

Prior to entering state S1 for the first time, controller 210 performs a reset operation in response to a reset control signal received from controller 172. During the reset operation, controller 210 reset the values stored in registers 230-X₁, 230-X₂, 230-X₃, 230-X₄, 240-Y₁, 240-Y₂, 240-Y₃ and 240-Y₄ to zero. In some embodiments, controller 210 then transitions to state S1, while in other embodiments, controller 210 waits until a start or state transition control signal is received from controller 172 before transitioning to state S1.

FIG. 11A identifies the activation values loaded into each register during states S1, S2, S5 and S8, and the general operation of the shift loop for register set 230 and the shift loop for register set 240.

During operation in state S1, first column data of IFM 83 are loaded into registers 230-X₁, 230-X₂, 230-X₃, 230-X₄ of register set 230. During operation in state S2, second column data of IFM 83 are loaded into registers 240-Y₁, 240-Y₂, 240-Y₃ and 240-Y₄ of register set 240. During operation in state S5, third column data of IFM 83 are loaded into registers 230-X₁, 230-X₂, 230-X₃, 230-X₄ of register set 230. During operation in state S8, fourth column data of IFM 83 are loaded into registers 240-Y₁, 240-Y₂, 240-Y₃ and 240-Y₄ of register set 240. These data load operations are discussed in more detail below.

The shift loop for register set 230 operates during states S2, S3, S4, S6, S7, S8 and S9. During operation in each of these states, the values stored in registers 230-X₁, 230-X₂, 230-X₃, 230-X₄ are shifted or rotated by one register in a closed loop manner. More particularly, the value stored in register 230-X₄ is provided to and stored in register 230-X₃, the value stored in register 230-X₃ is provided to and stored in register 230-X₂, the value stored in register 230-X₂ is provided to and stored in register 230-X₁, and the value stored in register 230-X₁ is provided to and stored in register 230-X₄. These data shift operations are discussed in more detail below.

The shift loop for register set 240 operates during states S1, S3, S4, S5, S6, S7 and S9. During operation in each of these states, the values stored in registers 240-Y₁, 240-Y₂, 240-Y₃ and 240-Y₄ are shifted or rotated by one register in a closed loop manner. More particularly, the value stored in register 240-Y₄ is provided to and stored in register 240-Y₃, the value stored in register 240-Y₃ is provided to and stored in register 240-Y₂, the value stored in register 240-Y₂ is provided to and stored in register 240-Y₁, and the value stored in register 240-Y₁ is provided to and stored in register 240-Y₄. These data shift operations are discussed in more detail below.

FIG. 11B highlights the registers that output activation values for each expanded activation sequence A₁, A₂, A₃ and A₄ during each state S1 to S9. These registers are listed in MEU output table 310 (FIG. 10B).

FIG. 11C identifies the registers that output the activation values for expanded activation sequence A₁, i.e., a₁, a₅, a₉, a₂, a₆, a₁₀, a₃, a₇ and a₁₁, FIG. 11D identifies the registers that output the activation values for expanded activation sequence A₂, i.e., a₅, a₉, a₁₃, a₆, a₁₀, a₁₄, a₇, a₁₁ and a₁₅, FIG. 11E identifies the registers that output the activation values for expanded activation sequence A₃, i.e., a₂, a₆, a₁₀, a₃, a₇, a₁₁, a₄, a₈ and a₁₂, and FIG. 11F identifies the registers that output the activation values for expanded activation sequence A₄, i.e., a₆, a₁₀, a₁₄, a₇, a₁₁, a₁₅, a₈, a₁₂ and a₁₆.

During operation in state S1, controller 210 cooperates with controller 172 to store first column data of IFM 83 into registers 230-X₁, 230-X₁, 230-X₁ and 230-X₁ of register set 230. In one embodiment, controller 210 receives a control signal from controller 172 indicating that the first activation value (i.e., a₁) is available for storage in register 230-X₁, and, in response, sends a control signal to input data selector 220 to select the output line coupled to 230-X₁, sends a control signal to data selector 230-M₁ to select the first input, and sends a control signal to register 230-X₁ to load or store the first activation value presented to its input. In certain embodiments, controller 210 sends a data ready signal to control 172 after the first activation has been loaded; in other embodiments, controller 172 simply waits until a predetermined time period has elapsed.

The remaining registers are loaded in a similar manner. Controller 210 then receives a control signal from controller 172 indicating that the second activation value (i.e., a₅) is available for storage in register 230-X₂, and, in response, sends a control signal to input data selector 220 to select the output line coupled to 230-X₂, sends a control signal to data selector 230-M₂ to select the first input, and sends a control signal to register 230-X₂ to load or store the second activation value presented to its input.

Controller 210 then receives a control signal from controller 172 indicating that the third activation value (i.e., a₉) is available for storage in register 230-X₃, and, in response, sends a control signal to input data selector 220 to select the output line coupled to 230-X₃, sends a control signal to data selector 230-M₃ to select the first input, and sends a control signal to register 230-X₃ to load or store the third activation value presented to its input.

Controller 210 then receives a control signal from controller 172 indicating that the fourth activation value (i.e., a₁₃) is available for storage in register 230-X₄, and, in response, sends a control signal to input data selector 220 to select the output line coupled to 230-X₄, sends a control signal to data selector 230-M₄ to select the first input, and sends a control signal to register 230-X₄ to load or store the fourth activation value presented to its input.

Also during operation in state S1, controller 210 operates the shift loop for register set 240 in which the values stored in registers 240-Y₁, 240-Y₂, 240-Y₃ and 240-Y₄ are shifted or rotated by one register in a closed loop manner. More particularly, controller 210 sends control signals to data selectors 240-M₁, 240-M₂, 240-M₃ and 240-M₄ to select their respective first inputs, sends control signals to registers 240-Y₁, 240-Y₂, 240-Y₃ and 240-Y₄ to latch their respective stored values onto their outputs, and then sends control signals to registers 240-Y₁, 240-Y₂, 240-Y₃ and 240-Y₄ to load or store the values presented to their respective inputs. In this embodiment, each register 240-Y₁, 240-Y₂, 240-Y₃ and 240-Y₄ has an output that latches the stored value so that a new value presented to its input may be stored at the same time.

Finally, during operation in state S1, controller 210 send a control signal to output data selector 250 to select the registers to output the appropriate elements of each expanded activation sequence A₁, A₂, A₃ and A₄, as described above.

State S5 operates in the same manner as state S1, loading the third column data of IFM 83 into registers 230-X₁, 230-X₂, 230-X₃ and 230-X₄ of register set 230, operating the shift loop of register set 240, and outputting the appropriate elements of each expanded activation sequence A₁, A₂, A₃ and A₄.

During operation in state S2, controller 210 cooperates with controller 172 to store second column data of IFM 83 into registers 240-Y₁, 240-Y₂, 240-Y₃ and 240-Y₄ of register set 240. In one embodiment, controller 210 receives a control signal from controller 172 indicating that the first activation value (i.e., a₂) is available for storage in register 240-Y₁, and, in response, sends a control signal to input data selector 220 to select the output line coupled to 240-Y₁, sends a control signal to data selector 240-M₁ to select the first input, and sends a control signal to register 240-Y₁ to load or store the first activation value presented to its input. In certain embodiments, controller 210 sends a data ready signal to control 172 after the first activation has been loaded; in other embodiments, controller 172 simply waits until a predetermined time period has elapsed.

The remaining registers are loaded in a similar manner. Controller 210 then receives a control signal from controller 172 indicating that the second activation value (i.e., a₆) is available for storage in register 240-Y₂, and, in response, sends a control signal to input data selector 220 to select the output line coupled to 240-Y₂, sends a control signal to data selector 240-M₂ to select the first input, and sends a control signal to register 240-Y₂ to load or store the second activation value presented to its input.

Controller 210 then receives a control signal from controller 172 indicating that the third activation value (i.e., a₁₀) is available for storage in register 240-Y₃, and, in response, sends a control signal to input data selector 220 to select the output line coupled to 240-Y₃, sends a control signal to data selector 240-M₃ to select the first input, and sends a control signal to register 240-Y₃ to load or store the third activation value presented to its input.

Controller 210 then receives a control signal from controller 172 indicating that the fourth activation value (i.e., a₁₄) is available for storage in register 240-Y₄, and, in response, sends a control signal to input data selector 220 to select the output line coupled to 240-Y₄, sends a control signal to data selector 240-M₄ to select the first input, and sends a control signal to register 240-Y₄ to load or store the fourth activation value presented to its input.

During operation in state S2, controller 210 also operates the shift loop for register set 230 in which the values stored in registers 230-X₁, 230-X₂, 230-X₃ and 230-X₄ are shifted or rotated by one register in a closed loop manner. More particularly, controller 210 sends control signals to data selectors 230-M₁, 230-M₂, 230-M₃ and 230-M₄ to select their respective first inputs, sends control signals to registers 230-X₁, 230-X₂, 230-X₃ and 230-X₄ to latch their respective stored values onto their outputs, and then sends control signals to registers 230-X₁, 230-X₂, 230-X₃ and 230-X₄ to load or store the values presented to their respective inputs. In this embodiment, each register 230-X₁, 230-X₂, 230-X₃ and 230-X₄ has an output that latches the stored value so that a new value presented to its input may be stored at the same time.

For example, at the end of operation in state S1, register 230-X₁ stores the value a₁, register 230-X₂ stores the value a₅, register 230-X₃ stores the value a₉, and register 230-X₄ stores the value a₁₃. During the shift loop operation for register set 230, the value a₁ is shifted or rotated from register 230-X₁ to register 230-X₄, the value a₅ is shifted or rotated from register 230-X₂ to register 230-X₁, the value a₉ is shifted or rotated from register 230-X₃ to register 230-X₂, and the value a₁₃ is shifted or rotated from register 230-X₄ to register 230-X₃.

Finally, during operation in state S2, controller 210 send a control signal to output data selector 250 to select the registers to output the appropriate elements of each expanded activation sequence A₁, A₂, A₃ and A₄, as described above.

State S8 operates in the same manner as state S2, loading the fourth column data of IFM 83 into registers 240-Y₁, 240-Y₂, 240-Y₃ and 240-Y₄ of register set 240, operating the shift loop of register set 230, and outputting the appropriate elements of each expanded activation sequence A₁, A₂, A₃ and A₄.

During operation in states S3, S6 and S7, controller 210 operates the shift loop for register set 230 and register set 240, and sends a control signal to output data selector 250 to output the appropriate elements of each expanded activation sequence A₁, A₂, A₃ and A₄. No column data of IFM 83 are loaded into register set 230 or register set 240 during operation of these states.

Other embodiments are also contemplated by the present disclosure.

FIG. 12 depicts a flow diagram 400 presenting functionality for expanding a matrix, in accordance with embodiments of the present disclosure.

At 410, an input data selector receives first matrix data in a columnwise format.

At 420, the input data selector provides the first matrix data to a first register set and a second register set. The first register set includes a plurality of data selectors and a plurality of registers arranged in a first shift loop. The second register set includes a plurality of data selectors and a plurality of registers arranged in a second shift loop.

At 430, an output data selector, coupled to the first register set and the second register set, outputs second matrix data in a rowwise format.

Embodiments of the present disclosure advantageously provide a matrix expansion unit that efficiently implements the IM2COL software function in hardware. The matrix expansion unit is disposed inline between the memory and the CPU, specialized processor, hardware accelerator processing engine, etc., and converts the original version of the IFM matrix to an IM2COL version. The matrix expansion unit advantageously reduces the memory footprint to that of the native convolution operation, reduces the memory bandwidth required for data movement, which increases the power efficiency at the system level, and takes advantage of the compute regularity of matrix multiplication, which can be more readily optimized in hardware.

The embodiments described above and summarized below are combinable.

In one embodiment, a matrix expansion unit includes an input data selector, a first register set, a second register set, and an output data selector. The input data selector is configured to receive first matrix data in a columnwise format. The first register set is coupled to the input data selector, and includes a plurality of data selectors and a plurality of registers arranged in a first shift loop. The second register set is coupled to the data selector, and includes a plurality of data selectors and a plurality of registers arranged in a second shift loop. The output data selector is coupled to the first register set and the second register set, and is configured to output second matrix data in a rowwise format.

In another embodiment of the hardware accelerator, the hardware accelerator further includes a communication bus interface configured to receive at least a portion of an ANN model with ANN weights, and input data, and transmit output data; and a controller, coupled to the communication bus interface and the memory, configured to extract the first matrix data from the input data, where the memory is further configured to store the portion of the ANN model, the ANN weights, the input data and the output data, and the CE includes a plurality of multiply-and-accumulate (MAC) units.

In another embodiment of the hardware accelerator, the ANN model is a convolutional neural network (CNN) model that includes an input layer, at least one convolutional layer, a fully connected layer and an output layer.

In another embodiment of the hardware accelerator, the first register set includes r data selectors and r registers, each data selector having a first input coupled to an output of the input data selector, a second input coupled to a preceding register in the first shift loop and an output coupled to an associated register.

In another embodiment of the hardware accelerator, each register in the first shift loop is coupled to the output of the associated data selector and to the second input of a following data selector in the first shift loop.

In another embodiment of the hardware accelerator, the second register set includes r data selectors and r registers, each data selector having a first input coupled to an output of the input data selector, a second input coupled to a preceding register in the second shift loop and an output coupled to an associated register.

In another embodiment of the hardware accelerator, each register in the second shift loop is coupled to the output of the associated data selector and to the second input of a following data selector in the second shift loop.

In another embodiment of the hardware accelerator, the input data selector is configured to alternate outputting the first matrix data between the first register set and the second register set.

In another embodiment of the hardware accelerator, the first matrix data comprise one column of the first matrix, and the second matrix data comprise one row of a second matrix.

In a further embodiment, a matrix expansion unit includes an input data selector configured to receive first matrix data in a columnwise format; a first register set, coupled to the input data selector, including a plurality of data selectors and a plurality of registers arranged in a first shift loop; a second register set, coupled to the data selector, including a plurality of data selectors and a plurality of registers arranged in a second shift loop; and an output data selector, coupled to the first register set and the second register set, configured to output second matrix data in a rowwise format.

In another embodiment of the matrix expansion unit, the first register set includes r data selectors and r registers, each data selector having a first input coupled to an output of the input data selector, a second input coupled to a preceding register in the first shift loop and an output coupled to an associated register.

In another embodiment of the matrix expansion unit, each register in the first shift loop is coupled to the output of the associated data selector and to the second input of a following data selector in the first shift loop.

In another embodiment of the matrix expansion unit, the second register set includes r data selectors and r registers, each data selector having a first input coupled to an output of the input data selector, a second input coupled to a preceding register in the second shift loop and an output coupled to an associated register.

In another embodiment of the matrix expansion unit, each register in the second shift loop is coupled to the output of the associated data selector and to the second input of a following data selector in the second shift loop.

In another embodiment of the matrix expansion unit, the input data selector is configured to alternate outputting the first matrix data between the first register set and the second register set.

In another embodiment of the matrix expansion unit, the input data selector is an input demultiplexor including an input and a plurality of outputs, the input configured to receive first matrix data; the output data selector is an output multiplexor including a plurality of inputs and a plurality of outputs; the first register set includes at least first, second, third and fourth multiplexors and at least first, second, third and fourth registers, where the first multiplexor has a first input coupled to a first output of the input demultiplexor, a second input coupled to the second register and an output coupled to the first register, the second multiplexor has a first input coupled to the first output of the input demultiplexor, a second input coupled to the third register and an output coupled to the second register, the third multiplexor has a first input coupled to the first output of the input demultiplexor, a second input coupled to the fourth register and an output coupled to the third register, the fourth multiplexor has a first input coupled to the first output of the input demultiplexor, a second input coupled to the first register and an output coupled to the fourth register, and the first, second, third and fourth registers are coupled to respective inputs of the output multiplexor; and the second register set includes at least first, second, third and fourth multiplexors and at least first, second, third and fourth registers, where the first multiplexor has a first input coupled to a second output of the input demultiplexor, a second input coupled to the second register and an output coupled to the first register, the second multiplexor has a first input coupled to the second output of the input demultiplexor, a second input coupled to the third register and an output coupled to the second register, the third multiplexor has a first input coupled to the second output of the input demultiplexor, a second input coupled to the fourth register and an output coupled to the third register, the fourth multiplexor has a first input coupled to the second output of the input demultiplexor, a second input coupled to the first register and an output coupled to the fourth register, and the first, second, third and fourth registers are coupled to respective inputs of the output multiplexor.

In another embodiment of the matrix expansion unit, the first matrix data comprise one column of a first matrix, and the second matrix data comprise one row of a second matrix.

In a further embodiment, a method for expanding a matrix using a matrix expansion unit includes receiving, at an input data selector, first matrix data in a columnwise format; providing, by the input data selector, the first matrix data to a first register set and a second register set, the first register set including a plurality of data selectors and a plurality of registers arranged in a first shift loop, and the second register set including a plurality of data selectors and a plurality of registers arranged in a second shift loop; and outputting, by an output data selector coupled to the first register set and the second register set, second matrix data in a rowwise format.

In another embodiment of the method, the first matrix data are periodically received; the first matrix data are alternatingly provided to the first register set and the second register set; and the second matrix data are continuously output.

In another embodiment of the method, the first matrix data comprise one column of a first matrix, and the second matrix data comprise one row of a second matrix.

While implementations of the disclosure are susceptible to embodiment in many different forms, there is shown in the drawings and will herein be described in detail specific embodiments, with the understanding that the present disclosure is to be considered as an example of the principles of the disclosure and not intended to limit the disclosure to the specific embodiments shown and described. In the description above, like reference numerals may be used to describe the same, similar or corresponding parts in the several views of the drawings.

In this document, relational terms such as first and second, top and bottom, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms “comprises,” “comprising,” “includes,” “including,” “has,” “having,” or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. An element preceded by “comprises . . . a” does not, without more constraints, preclude the existence of additional identical elements in the process, method, article, or apparatus that comprises the element.

Reference throughout this document to “one embodiment,” “certain embodiments,” “an embodiment,” “implementation(s),” “aspect(s),” or similar terms means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Thus, the appearances of such phrases or in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments without limitation.

The term “or” as used herein is to be interpreted as an inclusive or meaning any one or any combination. Therefore, “A, B or C” means “any of the following: A; B; C; A and B; A and C; B and C; A, B and C.” An exception to this definition will occur only when a combination of elements, functions, steps or acts are in some way inherently mutually exclusive. Also, grammatical conjunctions are intended to express any and all disjunctive and conjunctive combinations of conjoined clauses, sentences, words, and the like, unless otherwise stated or clear from the context. Thus, the term “or” should generally be understood to mean “and/or” and so forth. References to items in the singular should be understood to include items in the plural, and vice versa, unless explicitly stated otherwise or clear from the text.

Recitation of ranges of values herein are not intended to be limiting, referring instead individually to any and all values falling within the range, unless otherwise indicated, and each separate value within such a range is incorporated into the specification as if it were individually recited herein. The words “about,” “approximately,” or the like, when accompanying a numerical value, are to be construed as indicating a deviation as would be appreciated by one of ordinary skill in the art to operate satisfactorily for an intended purpose. Ranges of values and/or numeric values are provided herein as examples only, and do not constitute a limitation on the scope of the described embodiments. The use of any and all examples, or exemplary language (“e.g.,” “such as,” “for example,” or the like) provided herein, is intended merely to better illuminate the embodiments and does not pose a limitation on the scope of the embodiments. No language in the specification should be construed as indicating any unclaimed element as essential to the practice of the embodiments.

For simplicity and clarity of illustration, reference numerals may be repeated among the figures to indicate corresponding or analogous elements. Numerous details are set forth to provide an understanding of the embodiments described herein. The embodiments may be practiced without these details. In other instances, well-known methods, procedures, and components have not been described in detail to avoid obscuring the embodiments described. The description is not to be considered as limited to the scope of the embodiments described herein.

In the following description, it is understood that terms such as “first,” “second,” “top,” “bottom,” “up,” “down,” “above,” “below,” and the like, are words of convenience and are not to be construed as limiting terms. Also, the terms apparatus, device, system, etc. may be used interchangeably in this text.

The many features and advantages of the disclosure are apparent from the detailed specification, and, thus, it is intended by the appended claims to cover all such features and advantages of the disclosure which fall within the scope of the disclosure. Further, since numerous modifications and variations will readily occur to those skilled in the art, it is not desired to limit the disclosure to the exact construction and operation illustrated and described, and, accordingly, all suitable modifications and equivalents may be resorted to that fall within the scope of the disclosure. 

What is claimed is:
 1. A hardware accelerator for an artificial neural network (ANN), comprising: a computing engine (CE) coupled to a memory configured to store a first matrix; and a matrix expansion unit, coupled to the memory and the CE, including: an input data selector configured to receive, from the memory, first matrix data in a columnwise format, a first register set, coupled to the input data selector, including a plurality of data selectors and a plurality of registers arranged in a first shift loop, a second register set, coupled to the input data selector, including a plurality of data selectors and a plurality of registers arranged in a second shift loop, and an output data selector, coupled to the first register set and the second register set, configured to output, to the CE, second matrix data in a rowwise format.
 2. The hardware accelerator of claim 1, further comprising: a communication bus interface configured to: receive at least a portion of an ANN model with ANN weights, and input data, and transmit output data; and a controller, coupled to the communication bus interface and the memory, configured to extract the first matrix data from the input data, where: the memory is further configured to store the portion of the ANN model, the ANN weights, the input data and the output data, and the CE includes a plurality of multiply-and-accumulate (MAC) units.
 3. The hardware accelerator of claim 2, where the ANN model is a convolutional neural network (CNN) model that includes an input layer, at least one convolutional layer, a fully-connected layer and an output layer.
 4. The hardware accelerator of claim 1, where the first register set includes r data selectors and r registers, each data selector having a first input coupled to an output of the input data selector, a second input coupled to a preceding register in the first shift loop and an output coupled to an associated register.
 5. The hardware accelerator of claim 4, where each register in the first shift loop is coupled to the output of the associated data selector and to the second input of a following data selector in the first shift loop.
 6. The hardware accelerator of claim 5, where the second register set includes r data selectors and r registers, each data selector having a first input coupled to an output of the input data selector, a second input coupled to a preceding register in the second shift loop and an output coupled to an associated register.
 7. The hardware accelerator of claim 6, where each register in the second shift loop is coupled to the output of the associated data selector and to the second input of a following data selector in the second shift loop.
 8. The hardware accelerator of claim 1, where the input data selector is configured to alternate outputting the first matrix data between the first register set and the second register set.
 9. The hardware accelerator of claim 1, where the first matrix data comprise one column of the first matrix, and the second matrix data comprise one row of a second matrix.
 10. A matrix expansion unit, comprising: an input data selector configured to receive first matrix data in a columnwise format; a first register set, coupled to the input data selector, including a plurality of data selectors and a plurality of registers arranged in a first shift loop; a second register set, coupled to the data selector, including a plurality of data selectors and a plurality of registers arranged in a second shift loop; and an output data selector, coupled to the first register set and the second register set, configured to output second matrix data in a rowwise format.
 11. The matrix expansion unit of claim 10, where the first register set includes r data selectors and r registers, each data selector having a first input coupled to an output of the input data selector, a second input coupled to a preceding register in the first shift loop and an output coupled to an associated register.
 12. The matrix expansion unit of claim 11, where each register in the first shift loop is coupled to the output of the associated data selector and to the second input of a following data selector in the first shift loop.
 13. The matrix expansion unit of claim 12, where the second register set includes r data selectors and r registers, each data selector having a first input coupled to an output of the input data selector, a second input coupled to a preceding register in the second shift loop and an output coupled to an associated register.
 14. The matrix expansion unit of claim 13, where each register in the second shift loop is coupled to the output of the associated data selector and to the second input of a following data selector in the second shift loop.
 15. The matrix expansion unit of claim 10, where the input data selector is configured to alternate outputting the first matrix data between the first register set and the second register set.
 16. The matrix expansion unit of claim 10, where: the input data selector is an input demultiplexor including an input and a plurality of outputs, the input configured to receive first matrix data; the output data selector is an output multiplexor including a plurality of inputs and a plurality of outputs; the first register set includes at least first, second, third and fourth multiplexors and at least first, second, third and fourth registers, where: the first multiplexor has a first input coupled to a first output of the input demultiplexor, a second input coupled to the second register and an output coupled to the first register, the second multiplexor has a first input coupled to the first output of the input demultiplexor, a second input coupled to the third register and an output coupled to the second register, the third multiplexor has a first input coupled to the first output of the input demultiplexor, a second input coupled to the fourth register and an output coupled to the third register, the fourth multiplexor has a first input coupled to the first output of the input demultiplexor, a second input coupled to the first register and an output coupled to the fourth register, and the first, second, third and fourth registers are coupled to respective inputs of the output multiplexor; and the second register set includes at least first, second, third and fourth multiplexors and at least first, second, third and fourth registers, where: the first multiplexor has a first input coupled to a second output of the input demultiplexor, a second input coupled to the second register and an output coupled to the first register, the second multiplexor has a first input coupled to the second output of the input demultiplexor, a second input coupled to the third register and an output coupled to the second register, the third multiplexor has a first input coupled to the second output of the input demultiplexor, a second input coupled to the fourth register and an output coupled to the third register, the fourth multiplexor has a first input coupled to the second output of the input demultiplexor, a second input coupled to the first register and an output coupled to the fourth register, and the first, second, third and fourth registers are coupled to respective inputs of the output multiplexor.
 17. The matrix expansion unit of claim 10, where the first matrix data comprise one column of a first matrix, and the second matrix data comprise one row of a second matrix.
 18. A method for expanding a matrix using a matrix expansion unit, comprising: receiving, at an input data selector, first matrix data in a columnwise format; providing, by the input data selector, the first matrix data to a first register set and a second register set, the first register set including a plurality of data selectors and a plurality of registers arranged in a first shift loop, and the second register set including a plurality of data selectors and a plurality of registers arranged in a second shift loop; and outputting, by an output data selector coupled to the first register set and the second register set, second matrix data in a rowwise format.
 19. The method of claim 18, where: the first matrix data are periodically received; the first matrix data are alternatingly provided to the first register set and the second register set; and the second matrix data are continuously output.
 20. The method of claim 19, where the first matrix data comprise one column of a first matrix, and the second matrix data comprise one row of a second matrix. 